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Abstract 

Inference with population genetic data usually treats the population pedigree as a nuisance parameter, 
the unobserved product of a past history of random mating. However, the history of genetic relationships 
in a given population is a fixed, unobserved object, and so an alternative approach is to treat this network 
of relationships as a complex object we wish to learn about, by observing how genomes have been noisily 
passed down through it. This paper explores this point of view, showing how to translate questions about 
population genetic data into calculations with a Poisson process of mutations on all ancestral genomes. 
This method is applied to give a robust interpretation to the f i statistic used to identify admixture, and to 
design a new statistic that measures covariances in mean times to most recent common ancestor between 
two pairs of sequences. The method more generally interprets population genetic statistics in terms 
of sums of specific functions over ancestral genomes, thereby providing concrete, broadly interpretable 
interpretations for these statistics. This provides a method for describing demographic history without 
simplified demographic models. More generally, it brings into focus the population pedigree, which is 
averaged over in model-based demographic inference. 


Introduction 

For most of the history of the field, geneticists made great progress with data that comprise a tiny fraction 
of all that is in the genome. Even the hundreds of thousands of markers on modern genotyping chips make 
certain inferences impossible due to ascertainment of these markers. Today, whole-genome sequence is being 
used to make striking new discoveries about health and human history [Haak et al., 2015]; such data will 
soon be commonplace as well in nonmodel organisms. The scale and unbiased nature of these data opens up 
new opportunities, but ■will also require new methods, as many assumptions made previously may no longer 
be necessary. 

To do this, it is helpful to begin with the process that generated these data. The rules by which the 
majority of genetic material is inherited by diploid, sexually reproducing organisms are simple: each new 
individual carries two distinct copies of the genome - one from mom, one from dad - and each of these 
copies is formed by recombining randomly chosen parts of the parent’s two copies so as to make a whole. 
The inherited copies may be modified by mutation, which provides the raw material not only for evolution, 
but also the means for us to learn about those genomes’ histories. 

The network describing parent-offspring relationships between past and present members of a population 
is known as the population pedigree', adding an encoding of the genetic relationships - which parts of each 
genome was inherited from which copy of the parental genomes - produces the ancestral recombination graph 
[ARG, described by Griffiths and Marjoram, 1997]. Supposing that any particular position in one of those 
copies of the genome was inherited from a single one of the two ancestral copies - for the most part true - 
then at each point on the genome these genetic relationships form a tree (known as the gene tree) that is 
embedded within the larger pedigree. Just as the population pedigree can be thought of as constraining the 
collection of gene trees, so the ancestral recombination graph constrains the patterns of concordance and 
discordance of genetic variants created by mutation within each population. 

Inference with population genetic data usually works by specifying a population model - e.g., a randomly 
mating population of fixed size N - that determines a probability distribution on the population pedigree, and 
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then assuming that mutation and recombination occurs independently within this [Ewens, 2004]. This paper 
deals with the intermediate layer - if we take the population pedigree, or even the ancestral recombination 
graph, as fixed, then what can we learn about it? What are the statistics of population genetics telling us 
about it? This is partly a matter of philosophy: should the population pedigree that has actually occurred 
be treated as an instantiation of a random process, with the goal of inferring parameters of that process? 
Or, is it better to treat the population pedigree as a nonrandom but complex and partially observed object 
that we wish to estimate summary statistics of? The former view may be most appropriate if we have one or 
several well-defined demographic models we believe are close to the truth, but the latter is useful for making 
robust inferences in the absence of any realistic demographic model, as well as for identifying whether we 
have power to distinguish such models. 

The approach of this paper is simple: Assuming that mutations occur independently of demography and 
recombination (and are hence neutral), these can be treated as an inhomogeneous Poisson process on the set 
of all ancestral genomes. Many population genetics statistics can be written as integrals of specific functions 
against this Poisson process. By treating the ancestral recombination graph as fixed, the means and variances 
of these statistics can be found in terms of integrals of the same functions against the mutation rate, and can 
therefore be interpreted as descriptive statistics of the unknown, but real, ancestral recombination graph. 
Stochastic models of population genetics generally include demography, recombination, and mutation: this 
paper takes the outcomes of the first two as fixed, dealing only with the randomness of mutation. More 
information is obtainable by also treating recombination as random, but it is useful to begin with only 
mutation, especially as most commonly used population genetics statistics do not model recombination (e.g., 
the site frequency spectrum). The addition of recombination will be described in a companion paper. 

The remainder of this paper is structured as follows. After a review of some of the literature, two main 
results are presented in non-technical terms, given for their inherent interest as well as examples of the 
general method of calculation that is presented in the subsequent section. Next come proofs of the main 
results, and some more material, followed by some more discussion of the results and future work. 

Background 

Perhaps the main point of this paper is to see what can be learned about the population pedigree, or 
more properly the ARG, by taking an empirical point of view, i.e. treating it as a fixed, but unknown 
object. Slatkin [1991] interprets probabilities of identity by descent and Fst in terms of genealogies; the 
discussion is framed in terms of randomly mating populations, but much of the discussion is applicable 
without thinking of the genealogies as random. McVean [2002, 2009] continued in this vein, interpreting two 
common population genetics statistics, linkage disequilibrium and the principal components decomposition, 
in terms of the realized genealogy of the population. Wakeley et al. [2012] begins from a similar philisophical 
point, that the pedigree should be treated as a fixed parameter, and goes on to show that data generated 
under a fixed pedigree differ very little from those generated under a standard coalescent model (which does 
not incorporate correlations due to a fixed pedigree). 

Formal methods describing expectations of gene transmission in a known pedigree have long been used 
by plant and animal breeders. In particular, Wright [1922] ’s method of “path coefficients” treated the 
pedigree was known and averaged over recombination and segregation. These methods are related to a 
parallel issue arising in genetic association studies [reviewed by Speed and Balding, 2014] where the pedigree 
is a confounding factor, and alternative approaches can be seen as estimating the actual degree of genetic 
relatedness or as averaging over it. 

The empirical point of view for genealogies is much more natural in the context of phylogenetics, where 
between most groups of species, all loci have nearly the same genealogical relationship, and the goal is 
to directly infer this common tree. For instance, Gillespie and Langley [1979] pointed out that variation 
in coalescence time contributed to variance in between-species divergence, and Edwards and Beerli [2000] 
further partition the variance in divergence into components induced by the substitution process and by 
the demographic process. Pluzhnikov and Donnelly [1996] also described, in a model of random mating, the 
variance of Watterson’s and Tajima’s estimates of 6 expected along the genome as a function of recombination 
rate. Similarly, studies of nonrecombining loci often treat the demography as an unknown parameter. For 
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instance, Tang et al. [2002] took the frequentist approach to estimating time to most recent common ancestor 
(TMRCA) for modern human Y-chromosomes. Hudson [2007] continued this work, finding the variance of 
Thomson et al. [2000]’s TMRCA estimator (which is based on the density of segregating derived mutations). 

Early examples of using whole-genome data to estimate an empirical quantity of the population pedigree 
were the inferences of gene flow between diverged species by Kulathinal et al. [2009] (between Drosophila 
species), and by Green et al. [2010] (Neanderthal to modern humans). A wider family of similar statistics were 
subsequently developed in the human population genetics literature [e.g., Durand et al., 2011, Patterson et al., 
2012], and have formed the basis for many new insights into human history and evolution. However, their 
behavior under complex, non-deme-based models of population structure is thus far unclear. Understanding 
these in a more general context was a major impetus for the present work. 


Main results 

The purpose of this paper is to outline a general method of calculation, the utility of which is best commu¬ 
nicated with examples. First, a few assumptions. I assume that at each site there are no more than two 
variants seen (the proportion of triallelic sites within populations is extremely small, but see Jenkins and 
Song [2011]); and that at each site a reference allele has been chosen; “allele frequencies” are frequencies of 
the other, alternate allele. However, all statistics developed here are invariant under relabeling of alleles. 

I also make the standing assumption of low per base pair mutation rate - precisely, the infinite sites 
model of mutation [Watterson, 1975], under which each new mutation occurs at a distinct site, and is hence 
observable. This assumption is quite common in population genetics, where the typical time separating 
samples is small enough to make this assumption a good one (but I return to this below). 

As a first example, consider the / 4 statistic [Patterson et al., 2012]: if X, Y, U, and V are four sets of 
sampled genomes, and px {£) is the allele frequency among the samples in X at site £, etcetera, then the / 4 
statistic of these samples is 

1 G 

U(X,Y;U,V) = - -PYW)(Pu(t) -PvW) (1) 

u i -i 

where the sum is over some set of G sites. Notice that / 4 (X, Y; £/, V) = —/ 4 (Y, X; [/, U). This implies 
that if the relationship of X and Y to U and V is symmetric, then the expected value of the statistic is 
zero. In particular, it is zero under a tree-based model of populations, in which X and Y diverged from a 
common ancestral population more recently than their common ancestors with U or V, and there was no 
subsequent gene flow from U or V. Green et al. [2010] and Reich et al. [2010] used these statistics to provide 
evidence for gene flow from archaic hominids into modern humans. For example, taking X to be a sample of 
subsaharan Africans, Y a sample of western Europeans, U a Neanderthal, and V a chimpanzee, a significantly 
negative value implies that European allele frequencies are more correlated with Neanderthal than expected 
under a model of no admixture; and given archaeological evicence, this supports interbreeding between 
Neanderthals and modern humans after leaving Africa. However, it is not at first clear how the statistic 
should perform under more realistic population histories (e.g. asymmetric gene flow between subsaharan 
Africa and Meditteranean populations). The results below provide a model-free interpretation: 

Theorem 1 (/ 4 statistic). Let X, Y, U, and V be collections of nx, ny, njj, and ny chromosomes, 
respectively, that have been genotyped at G sites. (I) For each position I in the genome and each ancestral 
individual’s chromosome, a, let Fx(a,i) denote the proportion of the samples in X that have inherited from 
ancestor a at position l, and similarly for Fy, Fjj, and Fy. Then 

1 G 

E[/ 4 (X, Y; U,V)\ = - ££>(*) (F A -M) - Fy{a, £))(Fu(a, £) - Fy(a,£)), (2) 

i-1 a 

where p(t) is the per-generation mutation rate at site l, and the sum over a is the sum over all ancestral 
chromosomes. (II) Equivalently, for any set of samples (x,y,u,v), let St XtU y<y, v \(£) denote the length of the 
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branch between the most recent common ancestor of x and u and the most recent common ancestor of y and 
v (which may be zero). Then 


E lf 4 (X,Y-,U,V)} = - 1 - £ (£)), (3) 

nxnynunv ^ G " 

where the sum is over all choices ofx€X,y€Y,u€U, and v € V. Furthermore, 

1 1 G 

vai[f A {X,Y-,U,V)\< - 7*2 ^v{Z){S(x,u)ty,v){t) +S(x, v y,( v , u ){!-))- (4) 

Tlx n Y n U n V — Ct — 
u v x,y,u,v £—1 

Note that Fx{a,l ) is unknown , and so equation (2) interprets f±{X,Y\U,V) as a descriptive statistic of 
demographic history. Once formulated using the general-purpose theory, the proof of this comes down to 
straightforward calculations with Poisson processes, which are carried out below. 

As a second example, the theory can be used to design a statistic, to estimate the covariance in times 
to most recent common ancestor between two pairs of chromosomes. Concretely, let (aq,aq) and ( 2 / 1 , J/ 2 ) be 
two pairs of sampled chromosomes, and for each position £ in the genome, let t x (£) denote the total number 
of ancestors from whom either aq or aq, but not both, have inherited at £, and likewise for t y (£). (So, t x (£) 
is the length of the tree, in number of meioses, between aq and aq at £.) Suppose the genome is divided into 
nonoverlapping pieces { W \,..., W n }, and we wish to estimate covariances of mean values of t x and t y , scaled 
by mutation rate, across these chunks. (The scaling by mutation rate could be removed if it was constant 
across sites, or known, but I am leaving it in for realism.) In other words, we want the covariance between 
litk, x = jypy[ 12eew k where yi is the mutation rate at site £, and the corresponding quantity for y. 

For convenience, also let /it x = (1/G) Yt=i hetx(£) denote the mean time to common ancestor across the 
entire genome, scaled by mutation rate; and likewise for yt k , v and yt v . 

Theorem 2 (Empirical covariance of mean coalescence times). Let N/-(x) denote the number of mutations 
that differentiate aq and aq in window Wk, Nk{y) the same for yi and y- 2 , and let Nk(x,y) denote the 
intersection of those two sets, i.e., the number of mutations that differentiate between aq and aq as well as 
between z/i and y 2 . Define Zjk = Nj{x)Nk{y)/\Wj\\Wk\ for j k, and Zjj = (Nfix) 2 — Nj(x,y))/\Wj\ 2 . 
The statistic 
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estimates the covariance in mean mutation-scaled coalescent times: 


E [C X J 



k —1 


with variance 


var [C XtV \ < 


n\W\ ’ 


(5) 


( 6 ) 

(7) 


if e = maxfc{max(/aifc )X , ytk, y )} and \W\ = max fc \W k \. 

A formal calculus of mutations on the ARG 

Genealogy Ambitiously, the object of study is the collection of genetic relationships between every indi¬ 
vidual ever in the population, past and present. To simplify the discussion of relationships between diploids, 
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consider relationships on only a single chromosome, and let A denote the collection of all copies of this chro¬ 
mosome in any individual alive at any point back to some arbitrarily distant time. The population pedigree, 
denoted V, is the directed acyclic graph with vertex set A that has an edge a —> b if a is a parent of b, so that 
each vertex in A has in-degree equal to two. (Although this is a different object than the pedigree relating 
the diploid individuals, rather than their constituent chromosomes, the two are simple transformations of 
each other.) Under the standing assumption of the infinite sites model, no two mutations fall in the same 
location, and so locations on the chromosome are indexed by the continuous interval Q = [0, G\, where G is 
the total length of the chromosome, in nucleotides. At each site in Q all ancestral chromosomes are related 
by a tree, embedded in the population pedigree, and this collection of trees is equivalent to the ARG. 

Mutation Under the infinite sites model, the mutation process is Poisson: concretely, the sites at which 
ancestral chromosome a £ A differs from their parent (i.e., has a mutation) are distributed as the points 
of an inhomogeneous Poisson process on Q. This Poisson process of mutations has intensity measure given 
by the function p, where p{£) is the mutation rate at £. To study the inheritance of mutations through 
the pedigree, these are next collected into a Poisson process on all ancestral chromosomes: formally, M is 
the point measure on A x Q formed by putting one point mass at each mutation. Equivalently, M counts 
mutations: for any region of the chromosome [i,j\ C Q and any set of ancestral chromosomes A C A, 
the value M{A x [i,j]) is the total number of mutations that occurred in the region [i,j\ on any of the 
chromosomes in A. 

Under the assumption that mutation occurs independently of the pedigree, the expected number of 
mutations occurring in any collection of pieces of ancestral chromosomes is simply the sum of p(£) across 
these pieces. Formally, M's mean measure has density p with respect to n<S>d£, where n is counting measure 
on A and d£ is Lesbegue measure on the chromosome. Below, I will abuse notation slightly and write dp for 
this measure, so that for any (measurable) subset of ancestral genomes S C A x Q, 

E[M(S)] = f dp(a,£). (8) 

Js 

The key simplifying assumption of the infinite sites model is that every mutation is observable: if two 
ancestral chromosomes a and b have both inherited a region of genome [i,j] from their most recent common 
ancestor on this region, then given the genomes of a and b on this segment, we also know the number of 
mutations that have occurred in this segment in any of the ancestors going back to the common ancestor. 

Additive statistics The formulation of mutations as a point measure on ancestral genomes is designed 
to make a certain set of statistics easy to formulate and analyze, namely, those that can be formed using 
integrals of test functions against the mutation process. Recall that since M is a point measure on the 
set of all ancestral genomes, A x Q, with unit mass at each ancestral mutation, then if y is a function on 
A x Q, the integral of y against M is the sum of the values of y at the locations of all mutations: defining 
{iria,k : 1 < k < n a } to be the locations of the mutations that occurred on ancestral chromosome a, 


/ n a 

x{a,£)dM(a,£) = EE x(a,m ajfc ). (9) 

k—1 

It is easier to write this as simply J y dM. Here is a simple example (for which all this notation is unnecessary): 

Example 1 (Number of mutations). Given a sample of chromosomes A C A let £ a{cl,£) = 1 if a £ A and 
Ca(o-,£) = 0 otherwise. Then J C,AdM counts the number of mutations appearing de novo in the sampled 
chromosomes. 

This is a simple prototype for what we would like to do more generally. But, notice that f C,AdM is 
unobservable given only the genomes in A - to identify new mutations, we need the parental genomes as 
well. Statistics of this form work by first, based on the samples, constructing a function y on ancestral 
genomes, and then adding up the value that y takes at the location of each mutation on any ancestral 
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x <a>- y\ 



Figure 1: A cartoon of values taken by the path between (left) two sampled chromosomes x and y , 
and (right) two groups of samples, X = {x\,X 2 }, and Y = {y\, t/ 2 , 2 / 3 }- The path between two single 
chromosomes is, marginally at each locus, the indicator of the path from each back to their common ancestor; 
but between larger groups the values it takes depends on the local topology of the tree relating the samples. 


genome. To be observable, such statistics will need to satisfy certain conditions: in particular, y must be 
zero for any piece of ancestral genome not inherited by any sampled genome, or inherited by all of them. 
This idea of observability under the infinite sites model is encoded by the following set of special functions: 

Definition 1 (Paths). (I) For any two chromosomes x and y in A, the path from x to y is the function 

, .. . I 1 ifxory, but not both, inherits from a at £ , . 

[x<^y\(a,t) = < (10) 

0 otherwise. 


(II) For any two collections of chromosomes X and Y, having |X| and \Y\ elements each, the path from X 
to Y is the function 


[X o Y}(a,i) 


1 

C(X,Y) 


^ [a; o y\{a,£), 

x,y 


(ii) 


where the sum is over distinct pairs x € X and y GY with x 7 ^ y, and C(X, Y) is the number of such pairs. 
These are depicted in figure 1. 

In other words, the path [x y] is the indicator function of the pieces of ancestral chromosomes that are 
inherited by x or y, but not both. At each locus £, the function [x y] (•, £) on A is the indicator function of 
the tree connecting x and y at l (but notably excluding the root). [X o Y] is the average of these indicators, 
which gives the proportion of paths between pairs of samples from the two groups that pass through each 
ancestor in the pedigree. 

Example 2 (Sequence divergence). The sequence divergence between two chromosomes x and y, denoted 
7r(x,y), is the mean density of distinguishing mutations; it is 

n{x,y) = ^ J [x y]dM. ( 12 ) 
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For a group of samples X we define the average pairwise divergence similarly: 


AX) 


1 


m(m-i) 


^2 7T (x 1 ,X 2 ) 


X\^X2^X 


l G J ([X»X])dM. 


(13) 

(14) 


Tajima [1983] proposed tv(X) as an estimator for 9 = 4 N e p, under the model of a large, randomly mating 
population of diploids with constant mutation rate p and effective population size N e . Indeed, E[7 t(X)] = 6 , 
but the variance of 7r(X) for a nonrecombining locus does not go to zero as the number of samples grows, 
and so this is regarded as a poor estimator of 9. However, it is clear that the variance does go to zero if 
calculated with the genealogy fixed, i.e., thought of as an estimator of the mean time to most recent common 
ancestor (multiplied by twice the mutation rate). Denoting by fit the empirical mean mutation-rate-scaled 
time to most recent common ancestor, the variance can be partitioned (as in Edwards and Beerli [2000]): 
var[7r(X)] = E[var[7r(X) | pt}] + var[E[7r(X) | pt]]. The first term here goes to zero as the size of the sample 
increases, but the second, due to randomness in the demographic process, does not. The question of whether 
7f(X) is a good estimator, and for what, becomes a question of whether we want to think of it as estimating 
a summary statistic of actual relationships or a parameter in a certain stochastic model of demography. 


Observable statistics For application to data, it only makes sense to consider statistics that can be 
obtained by comparing the genomes of a given set of chromosomes. As discussed above, under the infinite 
sites model this means that the genome sequence of a pair of samples x and y on subset of the genome L is 
determined by the number of mutations inherited by x or y but not both on L, or Yla^A Sfei \- x ^ v\( a A)- 
If 1 l(o,,£) = 1 for t G L and is zero otherwise, this is M(1l[x O y}), so the set of observable statistics, 
given a collection of samples A , is generated by the set of linear combinations of functions of this form, i.e., 
products of indicators of a segment of genome with a path between two samples. It is clear that we cannot 
learn about parts of the pedigree from which no samples have inherited, or about anything occurring longer 
ago than the most recent common ancestor of the samples at each site, but it is not clear how to formulate 
more generally what information about the ARG is or is not obtainable from finite samples of chromosomes. 

This notion of observability is made formal by the following definition: 

Definition 2 (Observable statistics). A function x( a A) on Ax Q is observable given a collection of chro¬ 
mosomes A if it is in the algebra generated by functions of the form l\ UtV ](a,€)[x y]{a,€), where [u,v] is 

an interval of the chromosome, and x and y are samples in A. 

There are further, unavoidable, limitations to what it is possible to learn from the data. For instance, the 
sequence divergence tt(x, y) between x and y is an estimate of the mean of the empirical distribution of times 
to most recent common ancestors (TMRCA) between x and y, multiplied by the mutation rate. Theorem 2 
gives an estimator of the variance of the mean TMRCA in windows. However, it is not possible to estimate the 
variance of the distribution of single-site TMRCA values without somehow modeling dependencies between 
sites induced by recombination. An easy way to see this is if each site has at most one mutation, then 
without using inter-site dependencies, the model is equivalent to one where the probability of a segregating 
mutation is constant, equal to the mean. To make further progress, it is necessary (and reasonable) to make 
further assumptions, that are beyond the scope of this paper. 


Moments 

Calculations with statistics in this formalism can be made with the help of the following general formula for 
integrals of test functions against a Poisson processes: 
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Lemma 1. [Generating function] For any test function 4> : A x Q —► K. for which f (j>dp is absolutely 
convergent, the statistic f <pdM is well-defined, and 


E 



= exp 


(/ ^ ~ ^ dl " ) ' 


(15) 


Proof. This is Campbell’s Theorem; see e.g. Kingman [1993]. Recall that J 4>dM is the sum of the value 
that cj) takes over all mutations occuring in M\ “well-defined” means that this sum is absolutely convergent 
even if there are an infinite number of mutations. □ 


We immediately get, by differentiating this formula, the moments: 

Lemma 2 (Mean and covariance). Let (j) and ^ be test functions on A x Q for which f ([dp, f ([dp, and 
J (fnfdp are all absolutely convergent. Then 


cov 




(16) 

(17) 


Proof. These are a standard calculations, but we carry out the second for covariance for concreteness. Let 
F(a, b) = exp (f (afi + bip)dM[. Using (15), 


<9 Q <9 6 logE [F(a,6)] | a=b=0 = J (jyipd/a 


On the other hand, exchanging the expectation and the derivatives, 

E [d a d b F(a, b)} - E [■ d a F(a , b )] E [d b F(a, 6)] 


d a d b \ogE [F{a,b)} a=b=0 = 


E [F(a,b) 2 


a—6—0 


= E 


cfipdM 


-E 


( i>dM 


E 


ifdM 


(18) 

(19) 

( 20 ) 


□ 


Expressions for more general moments, or statistics whose expected values match desired quantities, can 
be found using the following general formula: 


Lemma 3. [General Moments] Let S be the set of upper triangular n x n matrices with entries in {0,1} 
and whose columns sum to 1 (i.e. ways of partitioning the set {1, 2,... ,n} into sets), and for a G S let |er| 
denote the number of nonzero rows of a (i.e. the number of sets in the partition). Then for any collection 
of test functions (f>i ,..., (f> n on Ax Q for which the following integrals against p are absolutely convergent, 


E 


II / 4>idM 


= eii / nc 


( 21 ) 


cr£<S i 


and 


P n 

j=l crSS 



dM 


( 22 ) 


Note: methods for efficient computation of mixed moments using orthogonal polynomials are presented 
in a general framework by Peccati and Taqqu [2011]. 

Proof. By induction, by differentiating either the generating function (for the first formula) or the log of the 
generating function (for the second) of d 



Application of the method 

We can now use the formalism developed above to easily prove the theorems given in the Introduction. 

Proof of Theorem 1. Write G x n for the genotype of haplotype x at position £, coded as ‘0’ for the reference 
allele and ‘1’ for the alternate allele. First note that 

1 G 1 

h(X, Y-U,V) = -Y j - V (G x t - G yi )(G ut - G vt ), (23) 

G n x n Y nun v ^ 

i.e., fi is the average product of differences between pairs of haplotypes chosen from ( X , Y) and from (U, V), 
averaging over site in the genome and choices of pairs (x,y) and (u,v). For a given quadruple (x,y,u,v), 
sites will add to this sum if there is a mutation shared by x and u that y and v do not carry, or vice versa; 
and sites will subtract from the sum if there is a mutation that is shared by x and v that y and u do not 
carry, or vice versa. The theorem for f±(x, y ; u, v) then follows easily by writing the statistic as the difference 
of two Poisson random variables, divided by G. To carry this through using the notation here, note that the 
function \x <-»• v] [y u] — [x u] [y v] takes the value +1 on the central branch of trees with unrooted 
topology ((x, u), (y, v)), —1 on the central branch of trees with unrooted topology ((x, v), (y, u)), and is zero 
on the remaining topology ((x, y), (u,v)); therefore, 

U(X, Y ; U, V) = ([X -B- P] [F H [/] — [I H U}[Y O V ]). (24) 

The expected value of / 4 is therefore just the integral of [X o V][Y o U] — [X o U][Y o V\ against /q 
using linearity this can be rewritten in various ways to give the two interpretations given in the theorem. 
Since 


^ v][y #]) (S( x ,y);( u , v ){^) T S{x,u)\(y,v) W) , (25) 

expression (3) follows immediately. 

For the second interpretation, define * to be all ancestral genomes alive at some point in the remote past 
(longer ago than the maximum time to most recent common ancestor across the genome), and note that 
[x O it] = | * | 2 [* -H- x] [* u]. (The factor of | ★ | is only here to cancel the denominator in the definition of 

the path function.) Therefore, [ x O v\ [y O u] — [x O u] [ye»]=|* | 2 ([* O x] — [* <-)• y])([* <-)• u] — [* O u]). 
Since q(| * |[* x])(a,£) is equal to 1 if a is more recent than * and x has inerited from a at £, averaging 

over choices of x gives the function Fx defined in the theorem: | * |/z([* o X])(a, i) = / j,(£)F x (a ,£), so that 
equation (2) follows. Note that in this interpretation, each term in the product is not summable on its own 
(e.g., Y^ a Fx(a,£) = oo), but cancellation and a reasonable assumption about finiteness of recent common 
ancestors makes differences like F x — F Y summable. 

Finally, consider the variance. By theorem 2, 

var[/ 4 (X, Y ; U, V)] = ^ ({[X o P][F *+ U] - [X o U)[Y o P]} 2 ) . (26) 

By Jensen’s inequality, 

([X o V][Y o U] - [X o U][Y o P]) 2 <- 1 - V ([xod[y*+d-[x*+d[yod) 2 . (27) 

nxUYnuny ' 
x,y,u,v 

Since the summand is equal to {S^ xu yi yv \{£) — S^ xv yr y ^(£)), summed over £ this gives equation (4). 

□ 

Theorem 2 describes how to estimate the covariance in empirical mean coalescent times between two 
pairs of sampled chromosomes. Since the probability that two samples are heterozygous (i.e., differ) at a 
site is proportional to their TMRCA at that site (to first order), it is natural to suggest the covariance of 
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heterozygosities as an estimator of at least a simliar quantity. This would be arguably more natural from the 
population genetics point of view, as mean heterozygosities are an easily observed statistic. The following is 
therefore a useful contrast to theorem 2 . 

First, some more notation. Let (xi,x 2 ) and {y-\. y 2 ) be two pairs of sampled chromosomes, and for each 
position £ in the genome, recall that t x (£) is the total number of ancestors in the tree leading from x\ and x 2 
back to their common ancestor at £ (not including the common ancestor), and likewise for t y (£). We will also 
need t x n y (£), the number of ancestors in the intersection of these two trees. Just as fit x was defined above 
to be the mean time to common ancestor, scaled by mutation rate; so also pt x n y = (1 /G) Yht=i Tetxny{^)- 

Theorem 3 (Empirical covariance of heterozygosities). Let 9 X {£) = 1 if the genotypes of x 1 and X 2 differ 
at £, and 9 X {£) = 0 otherwise, and let 9 X = (1 /G)'f2,^_ 1 9 x {£) he the mean heterozygosity of (, X\,X 2 ); and 
likewise for 9 y (£) and 9 y . Then 

Q _ Y / ' - - __ 

E Y 0 x (t)0y(£) - 0 x 0y = titxny - iJLtxiity (28) 

t= 1 

and 

G 1 

var =^ t *n„W + 0(l/G 3 ). (29) 

t=i J 

Note that since pt x <C 1, the value of (28) is positive. 

Proof of Theorem 3. Under the infinite-sites model, 9 X {£) and 9 y {£) are both nonzero only if there is a 
mutation at £ that falls on both the path from X\ to X 2 and the path from y\ to y 2 , and so 

5>(/)*v(*) = [ i x 1 0 x 2 ]{a,£)[y- L y 2 \(a,£)dM(a,£). (30) 

t ^ 

To reduce the amount of ink below, for the meantime let <j> x = [x\ <-> X 2 ], 4> y = [ 3/1 y 2 ], and for a test 

function ip write M(ip) = J ipdM and p(ip) = J ip dp. In this shorthand, the statistic is 

^ E WWv(t) - My = ~^~M((p x (p y ) - ±M(<p x )M(<j> y ). (31) 

1=1 

Using Lemma 3, E [M(<t> x <f> y )\ = p(<p x <j) y ) and E [M((p x )M(<p y )\ = p((p x <p y ) + p(<p x )p(<p y ), which combine 
to give 

E M((p x (p v ) --^M((p x )M((j)y) = ^p{<t> x <t> y ) - ^p{(p x )p{(p y ), (32) 

which is equation (28). 

Again using Lemma 3, and the fact that <p x (a,£) 2 = <p x (a,£), var[M (<p x <p y )\ = p(<p x (p y ) and 

co \[M{(p x (p y ), M{(p x )M{(p y )\ = p{(j> x <t>y) (1 + p{(p x ) + p{<P y )) (33) 

and 

var [M(<p x )M(<p y )\ = p(<p x <p y ) (1 + 2(p((p x ) + p(<P y )) + 2 p(<p x <j> y ) + 4 p(<p x ) p(f> y )) 
so finally, equation (29) is 1/G 4 multiplied by 

(G- l) 2 var [M(<p x <j) v )] - 2 (G - l)co v[M(<f> x <j> y ),M(<j) x )M(<)) y )] +var [M(<p x )M((py)\ 

= h(<t>x<j>y) { (G — l ) 2 - 2 (G — 1)(1 + p((p x ) + p{<j>y)) 

+ 1 + 2 (p((p X ) + p((py)) + 2 p(f> x (py) + 4 p((f> X ) p(<Py)} . 


(34) 

(35) 

(36) 
□ 
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We now turn to the proof of Theorem 2, which is similar to the previous proof, but somewhat more 
involved. 


Proof of Theorem 2. Here, the goal is to design a statistic that estimates the covariance in mean coalescent 
times (where time is always scaled by mutation rate). Therefore, if we define 


fjk,x (n, £) 


WP\[ x i ,x 2 ](a,e) 


lUe W k 
otherwise 


(37) 


then i-i('il>k,x) = /it k,x gives the mean coalescent time on window Wk- We next need something whose 
expectation is ni‘ipk,x)hi'i' l l ) k,y)- By Lemma 3, this is 

E [M{ll)k, X )M{tpk,y) - M(tjjk,x1pk,y)\ = V(i’k,x)lJ'(lpk,y)- (38) 

Similarly, if we define f) x (a,£) = ^ Yfik=i V’fc,a;( a )'0> so that fi(ipx) = h-tx, we need something whose expecta¬ 
tion is which by the same lemma is 

E [Af(^)M(V> y ) - M(f> x 4>y)\ = n(fi) x )niffy). (39) 

These two combine to get 

i n 

C x , y = - {M(^ k ,x)M(^k,y) - M(^ ktX ^ kl y)} - {M(il> x )M(i/>y) - M^x'ipy)} . (40) 

1 k =1 


Now note that by linearity of M , 


M{ip x )M(ip y ) - M{ip x ip y ) = iEE {M{lpj tX )M(lpk,y) ~ j, x lfk,y)) ■ 

n j=1 k=1 

Defining Z jk = Miif jtX )Miip k>y ) - Miip jtX i/j k , y ), then 


^ n ^ n n 

Cx >y = “ Zkk ~ E2 


k =1 


j —1 k=1 


which, since M(V>fe, x ) = N k (x)/\W k \ and Miipk, x ipk, y ) = N k ix, y)/\W k \ 2 , is equation (5). 
The substantially lengthier calculation of the variance is postponed until the Appendix. 


(41) 


(42) 


□ 


The infinite sites assumption 


Some of the simplicity above (and more generally in population genetics) relies on the assumption that 
only one mutation can occur at each site, which generally results in expressions that are correct up to a 
factor proportional to the fraction of sites at which more than one mutation has occurred. To illustrate this, 
suppose that more than one mutation can occur per site; in other words, the Poisson process of mutations 
happens on a discrete, not continuous, set. 

Assume for the moment that all mutation rates are equal: nt = I 1 - Fix a pairs of sampled chromosomes 
( 21 , 2 : 2 ), and define £„(£) is defined to be 1 if n mutations have occurred at site i between 21 and 22 , and 
let £ n = (1 /G) 1 £n(0- The expected value of under this model is 


1 -nuU)(^xi^)) n 

G f- n! 

£=1 


(43) 


and so an estimate of n th moment of the empirical distribution of t x (£) could be However, since this 

is essentially estimating the density of sites at which there were n mutations, to have any accuracy requires 
a reasonable number of such sites, and distinguishing these from sequencing error. In practice, this is highly 
problematic, and is confounded by mutation rate heterogeneity. 
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Discussion 


In this paper I have explored the empirical approach to population genetics, by treating the (empirical, 
realized) ancestral recombination graph as a complex, unobserved, object we wish to learn about, rather 
than as an intermediate layer that is averaged over in the course of inferring parameters of interest in 
higher-level stochastic models (e.g., coalescent models). This approach certainly does not replace coalescent 
theory, but seems useful in that it can provide more concretely interpretable results to non-specialists (e.g., 
“numbers of common ancestors” rather than “coalescent rates”), and intuition about what statistics have 
the best power to distinguish between alternative population models. This way of thinking is certainly not 
new, but data that give us power to infer quantities directly at this level of abstraction is. 

Additionally, I have described a new formalism, that can simplify calculations related to population 
genetics statistics by writing these as integrals of functions against a Poisson random measure, which models 
the locations of mutations on the genomes of all possible ancestors. 

This point of view, and this formalism, leads to two general sorts of interpretations for a given statistic: 
first, in terms of the distribution of trees along which the samples are related at each locus; and second, in 
terms of weighted sums over ancestral chromosomes. This duality is easy to see considering simple examples 
such as pairwise divergences, and is less obvious for more complex statistics such as f±. 

fi and family Theorem 1 gives an interpretation of the f '4 statistic as a sum of products of differences in 
ancestry, over all ancestral genomes. Patterson et al. [2012] interprets this and related statistics in terms of 
shared drift along branches of an admixture graph (in which each edge is a randomly mating population). 
Since genetic drift is determined by the sharing of common ancestors, Theorem 1 can be seen as a more precise 
statement of the same observation. Since the main point of this paper is to lay out the general framework, 
I have not undertaken the task of recasting the entire family of related statistics (e.g., ABBA-BABA), but 
the general way forward can be seen by analogy to 

The unknown network of ancestors The problem at hand, to infer aspects of the unknown pedigree 
through which genetic material has been noisily inherited, has some parallels to the problem of active network 
tomography [reviewed in Lawrence et al., 2006]: given information on losses and delays of “probe” packets 
sent between a subset of peripheral nodes in a network, infer the topology of the network and certain 
characteristics of its internal nodes. Results in this field on identifiability [e.g., Singhal and Michailidis, 
2007, Gopalan and Ramasubramanian, 2012] would be very interesting in this context, although genomic 
data are substantially noisier, more sparsely collected, and more numerous. Others have already made use 
of the parallels to phylogenetics, in the other direction [Ni and Tatikonda, 2011]. 

Recombination A glaring omission in this paper is any treatment of linkage between sites: for instance, 
when calculating variances, each site is treated as independent ; surely this cannot be right, as genealogies 
at nearby sites are highly correlated? But, we begin by assuming that the mutation process at each site is 
independent given the genealogies (which seems for the most part reasonable), and take the entire empirical 
ancestral recombination graph as fixed, which includes the locations of ancestral recombination breakpoints. 
If the aim is to estimate descriptive statistics of the ARG then this is the correct point of view; but an 
intermediate position would be to treat only the population pedigree as given, and to additionally model 
randomness in recombination. This could allow for recovery of additional information left behind by recom¬ 
bination, and was the point of view taken in Ralph and Coop [2013]. This will be the subject of a companion 
paper. 

Mutation Above, we have allowed for heterogeneity in mutation rate, as this is known to be substantial 
[e.g., Misawa and Kikuno, 2009]. When applying these methods, it seems necessary to choose regions of the 
genome with comparable large-scale mutation rates, and then within these, to average over enough sites that 
small-scale heterogeneity will not hopelessly confound comparisons. Methods such as described in Lipson 
et al. [2015], may be useful in disentangling mutation rate heterogeneity from heterogeneity in demographic 
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histories along the genome. Furthermore, this paper only models diallelic markers, and completely ignores 
both sequence context and backmutation. This is common in population genetics methods, since the infinite 
sites assumption should be a good one at the typical levels of divergence encountered within species, especially 
if mutation rate heterogeneity in accounted for. Reconciliation of more realistic models of mutation with the 
Poisson process method described here would be difficult. 

Selection An important assumption behind this method is that mutations are independent of the ARG. 
This is not true for alleles under selection, but neither does this approach entirely assume neutrality: since 
the ARG is taken as given, if we knew which sites were under selection and excluded these from the analyses, 
then the assumptions would be satisfied. In other words, linked, selected sites are not an issue, as they act by 
distorting local genealogies. On the other hand, if a large fraction of segregating polymorphisms are actively 
under selection, this violates the model (but it is unclear what to replace it with). 
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The variance of C 


x;y 


Continuation of proof of Theorem 2. First note that (42) implies that 


Cx ’V ~ Z i k ( 

j=1 k =1 ' 


(44) 
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where Sjk = 1 if j = k and is zero otherwise. To find the variance of C XjV , we need to compute covariances 
of the Zjk terms. To do this, it is most efficient to record the generai case. The following lemma follows 
directly from Lemma 3, which uses upper-case roman characters for test functions to make the result more 
visually intuitive and easier to apply. 

Lemma 4. If A, B, C, and D are test functions as in Lemma 3, then 

cov[M(A)M(B), M(C)} = n(ABC) + n{A)ti{BC) + n{B)n{AC) (45) 

and 

co v[M(A)M(B),M(C)M(D)\ = fi(ABCD) + n(A)n{BCD) + ^{B)n(ACD) + n{C)n{ABD) 

+ n{D)n(ABC) + fj.(AC)n(BD) + ^AD)^BC) 

+ n(A)n(C)n(BD) + n(A)n{D)n(BC) ' 1 ’ 

+ fi(AC)n(B)n(D) + fj(AD)n(B)n(C) 

If j ^ k, then ipj yX and ipk,x are supported on disjoint parts of the ancestral genomes, and so = 0. 

For the same reason, the covariance of Zj k and Z( m is zero unless at least one of j and k match at least one 
of I and m. We are further helped by the fact that x = VTx/W" j 

First, we apply Lemma 4 with A = C = tpj,x and B = D = ifk,y and j ^ k (so, note that AB = AD = 
CB = CD = 0). 


y^[z jk ] = {i + l^'KVh,*) + \w k HA,y)} (47) 

Similarly, if A = %pj,x , B = ipk,y, C = ipk,x > and D = ipj,y- and j ^ k (so, note that AB = AC = BD = 
CD = 0). 


CO v[Zj k , Z kj ] = n(lpj, x 1pj,y)lJ’(lpk,y1pk,x) + h{4 , j,x)h(4’j,y)h{' l Pk,y4 l k,x) + h(4>j,x4’j,y)h{4>k,y)h{4>k,x) 
Now, with A = C = ifj yX , B = ipk,y, and C = ipi, y , and j, k and l all distinct, 

cov[Zj-fc, Zj{\ = ^(t)j, I )/((i,X^, ! ,) 

and similarly, 


(48) 


(49) 


CO v[Z kj ,Zji\ = fJ,(lpj, x 1pj, y )fJi(4’k,x)lJ'('>Pl, y )- 
Now, with A = C = ipj,xi B = ipj,yi and D = ipk,y, 


(50) 


COv[M(tp jyX )M(lp jty ),M(lp jtX )M(lp kty )] = ll{lp k ,y) j fJ'(i>j,xi’j,y) ( + MVb>) 


w. 


h(4>j,x)h{lpj,y] 


(51) 


and with A = ipj,xi B = 4>k,yi and C = ipj,xipj,y> 


CO v[M(lp jtX )M(lp kty ),M(lp jtX 1p j!y )\ = yyJJ'{4’k,y)n(4’j,x1pj,y)- 


These combine to give 


CO v[Zjj,Z jk ] = n(lp k ,y) <j H{4’j,x4>j,y)h(.4>j,x) + yy Jl(lpj,x)V'(4’j, V ) 


(52) 


(53) 
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Finally, taking A = C = ipj iX and B = D = ipj, y , we get 


and 


and 


so 


var [M(i) jtX )M(ip jt y)\ 


1 2 

2 j, X 1pj,y) d" (m(V1j,x) + A f (V’i,y)) 


COv[M(lp j>x )M(^ jt y), M(Tp jtX 1p jt y)] = ^2H{lpj, x 1pj,y) + (MV’j.x) + MV’j.l/) 


var[M('0 J> V’i,y)] 


1 

w] 


Ki’j.x^y) 


var[ Zjj] 


var [M{ip itX )M{ij}j t y)] - 2cov[M(-ip jtX )M(ip jty ),M(ip jtX ip jt y)\ + var [M(ip jtX i(> jt y)\ 

L i { 1 Pj,x)l J '(.' l Pj,y) d~ (mW’j,®) + A t (V’j,y)) + 2+ A t (V’j,yV’jtx) 


( 54 ) 


(55) 


(56) 


(57) 

(58) 
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We can put these together to obtain that 


Va,r[C x ,y\ — 2 ^ ^ CO v\Zjk, Zgm\ i^jk j (^£m j 


( 59 ) 


l<j,k/,m<n 

(n — l ) 2 


55 vai \- Z n] ~ "^4 ^ 51 cov[Z j:j ,Z jk + Z kj } + vax i Z jk\ 


3 

+ ^4 5 Z C0V [^'fc’ ^fcj] + Z4 X! C0V [^'fe’ 




( 60 ) 






2 2 

+ ^4 5Z C0v [^fci’%] + r4 X! cov [ z kji Zje] 

^4 5 ' ^A t (V , j,x)A t (V , j,3/) ^ yjjy (/^(Vb'.x) ”1“ A t (V > j,y )) T 2/i(f/ , i 7,yV , j,x)^ T ^i' l Pj,y' i Pj,x) 

_ 2(n_2) (n(lp k ,y) \+ 7^F^j,x)v{4 >j, V ) 


3^k 


+ MV’fc,*) i MVb,yVTxMVb,y) + j^rMVb.J/MVb, 


+ A 55 ( ^MVb>)M^,!/) { 7^7 + MV’j.x) + M(V’fc.y) } ) 

+ ^4 X] (MVb>VTy)MV’fc,yV’fc ) x) + H(^j,x)fJ-(lpj,y)fJ-{i>k,y1pk,x) + ^j,x^j,y)^k,y)^k,x)) 

J^k 

+ — ^^ 3 ,x)^k,y)^l,y) 


n‘ 


55 


j^k^e^j 


T 4 5 ' ^(,' l Pj,x' t Pj,y)^{' t Pk,x)lJ'{' t Pl,y)- 


jjtk^e^j 


( 61 ) 


The above is dominated by terms of the form (/rt) 3 /n|W / |, where fit is an average of ^(^>.) terms. To 
put a crude bound on this, assume that At(Vb>) and fj,{ipjy) are bounded by e for every j, resulting in 
varfC^y] < n | € W | , as in the theorem. 

□ 
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